Which chemical properties influence the quality of red wines?

Austin J. Alexander


Check out the data

What is the structure of your dataset?

There are 1,599 observations of red wines with 12 recorded features for each observation. Some of the features are related to each other (e.g., those related to acidity). Quality is the only categorical feature.

show column names
##  [1] "fixed.acidity"        "volatile.acidity"     "citric.acid"         
##  [4] "residual.sugar"       "chlorides"            "free.sulfur.dioxide" 
##  [7] "total.sulfur.dioxide" "density"              "pH"                  
## [10] "sulphates"            "alcohol"              "quality"
basic stats for each column
##  fixed.acidity   volatile.acidity  citric.acid    residual.sugar  
##  Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900  
##  1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900  
##  Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200  
##  Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539  
##  3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600  
##  Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.01200   Min.   : 1.00       Min.   :  6.00      
##  1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00      
##  Median :0.07900   Median :14.00       Median : 38.00      
##  Mean   :0.08747   Mean   :15.87       Mean   : 46.47      
##  3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00      
##  Max.   :0.61100   Max.   :72.00       Max.   :289.00      
##     density             pH          sulphates         alcohol     
##  Min.   :0.9901   Min.   :2.740   Min.   :0.3300   Min.   : 8.40  
##  1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50  
##  Median :0.9968   Median :3.310   Median :0.6200   Median :10.20  
##  Mean   :0.9967   Mean   :3.311   Mean   :0.6581   Mean   :10.42  
##  3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10  
##  Max.   :1.0037   Max.   :4.010   Max.   :2.0000   Max.   :14.90  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.636  
##  3rd Qu.:6.000  
##  Max.   :8.000
check for any NaN values
## [1] 0

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

It was unclear what features will be useful at this point.

Did you create any new variables from existing variables in the dataset?

No, there didn’t seem to be much of a need to create new variables.

check the distribution of quality
## 
##   3   4   5   6   7   8 
##  10  53 681 638 199  18

As can be seen above, the vast majority of wines are scored 5 or 6.

proportions

3’s and 4’s

## [1] 0.03939962

5’s and 6’s

## [1] 0.8248906

7’s and 8’s

## [1] 0.1357098

Scores of 5 and 6 account for over 82% percent of all scores! This suggests that the most useful information might be found by examining the lowest and highest scorers, but we’ll save that for later.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

None of the features seemed unusual enough to explore futher and, no, I didn’t notice any need for tidying/adjusting the form of the data at this point.

Univariate Plots Section

visualize quality

Reason for this plot: I wanted to visualize the distribution of quality scores.

Comments: Clearly, most wines were given mid-range scores.

visualize alcohol

visualize volatile.acidity

visualize sulphates

Reason for these plots: I wanted to visualize the distribution of each of these three categories because of the information gained by later plots (that were originally located at the beginning, but, due to project requirements, had to be moved to later sections).

Comments: Non-normal distributions.


Bivariate Plots Section

Examine correlations:

Reason for these plots: I wanted to visualize the relationship of categories to one another.

Comments: See below.

What is/are the main feature(s) of interest in your dataset?

Alcohol, volatile.acidity, and sulphates seem to be the features most highly correlated with quality scores as they present correlation coefficient values furthest from zero, so let’s examine them further.

The following graphs each display the distribution of wines by one of these three categories (e.g, alcohol), but they also distinguish the distributions by quality score for the purposes of making apparent any visibly-noticeable relationship between category value (e.g., high alcohol) and quality score.

facet alcohol by quality

facet volatile.acidity by quality

facet sulphates by quality

Reason for these plots: I wanted to visualize the distribution of each of these three categories as they related to quality scores.

Comments: See below.

The plots above each seem to suggest that lower or higher values in the three categories explored (alcohol, volatile.acidity, and sulphates) each have a relationship with wine quality scores, particularly in the case of the lowest and highest scores.

group reds by quality

Aside from an inexplicable dip, alcohol mean appears to be positively semi-linearly correlated with quality scores. Quantile lines show a similar story.

quality vs alcohol mean

quality vs alcohol with quantile lines

Conversely, as the volatile.acidity mean (and each quantile) increases, quality scores decrease.

quality vs volatile.acidity mean

quality vs volatile.acidity with quantile lines

The sulphate relationship with quality scores mirrors the one with alcohol.

quality vs sulphates mean

quality vs sulphates with quantile lines

Reason for these plots: I wanted to visualize the relationship of the average values of each category to the quality scores.

Comments: See below.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Simply put: alcohol, volatile.acidity, and sulphates (particularly the first two) appear to have an affect of the quality scores. Alcohol will be discussed below, but, in general, the lower the volatile acidity, the higher the quality score; the inverse is true for sulphates and quality score.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

I did not spend time looking at the other features because I’m focusing on answering the primary question driving this project.

What was the strongest relationship you found?

Alcohol. Funny enough, a higher alcohol content seems to encourage a higher score.


Multivariate Plots Section

As a continuation of the exploration of the three categories above, the boxplots below show in clear visible terms that the average category value (e.g., alcohol) is nicely correlated in each case with quality scores.

boxplot of alcohol colored by quality

boxplot of volatile.acidity colored by quality

boxplot of sulphates colored by quality

Reason for these plots: I wanted to visualize the relationship of the distribution and quantile values of each category to the quality scores.

Comments: There seems to be a definite relationship between each of these category values and the quality scores.

alcohol vs volatile.acidity faceted by quality

alcohol vs sulphates faceted by quality

Reason for these plots: I wanted to visualize the trinary relationship of the of two category values (with alcohol as a constant) and the quality scores.

Comments: As a result of the two faceted plots above, a general observation may be made: low volatile.acidity, high sulphates, and high alcohol content appear to be related to high quality scores.

alcohol vs volatile.acidity colored by quality

now only 3’s, 4’s, 7’s, and 8’s

now only 3’s and 8’s

alcohol vs sulphates

now only 3’s and 8’s

Reason for these plots: I wanted to visualize the trinary relationship of the of two category values (with alcohol as a constant) and the quality scores. In addition, I wanted to remove certain quality scores to attempt to visualize any clustering.

Comments: I see pretty solid visual evidence of clustering, further confirming the relationship between these three categories and quality scores.

alcohol vs volatile.acidity with only contour lines colored by quality showing quality clusters

Reason for these plots: Continuing my investigation of clustering.

Comments: The contour lines help to confirm visually clustering. By faceting (using colour) the contour lines by quality score, pretty obvious quality score clustering becomes apparent.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

As my examination continued, I felt better and better about the apparent relationship between alcohol, volatile.acidity, and sulphates and quality scores.

Were there any interesting or surprising interactions between features?

I found it interesting that sulphate levels seem to have a sweet spot when it comes to quality scores.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

attempts at a simple linear regression model
## 
## Calls:
## m1: lm(formula = quality ~ alcohol, data = reds)
## m2: lm(formula = quality ~ alcohol + volatile.acidity, data = reds)
## m3: lm(formula = quality ~ alcohol + volatile.acidity + sulphates, 
##     data = reds)
## m4: lm(formula = quality ~ alcohol + volatile.acidity + sulphates, 
##     data = reds)
## m5: lm(formula = quality ~ alcohol:volatile.acidity:sulphates, data = reds)
## m6: lm(formula = quality ~ alcohol * volatile.acidity * sulphates, 
##     data = reds)
## 
## ===================================================================================================
##                                            m1        m2        m3        m4        m5        m6    
## ---------------------------------------------------------------------------------------------------
## (Intercept)                              1.875***  3.095***  2.611***  2.611***  5.763***   1.285  
##                                         (0.175)   (0.184)   (0.196)   (0.196)   (0.058)    (2.188) 
## alcohol                                  0.361***  0.314***  0.309***  0.309***             0.426* 
##                                         (0.017)   (0.016)   (0.016)   (0.016)              (0.209) 
## volatile.acidity                                  -1.384*** -1.221*** -1.221***             9.044* 
##                                                   (0.095)   (0.097)   (0.097)              (4.030) 
## sulphates                                                    0.679***  0.679***             2.713  
##                                                             (0.101)   (0.101)              (3.226) 
## alcohol x volatile.acidity x sulphates                                          -0.036*     1.524* 
##                                                                                 (0.015)    (0.593) 
## alcohol x volatile.acidity                                                                 -0.996* 
##                                                                                            (0.389) 
## alcohol x sulphates                                                                        -0.184  
##                                                                                            (0.309) 
## volatile.acidity x sulphates                                                              -15.622* 
##                                                                                            (6.130) 
## ---------------------------------------------------------------------------------------------------
## R-squared                                   0.227     0.317     0.336     0.336     0.003     0.351
## adj. R-squared                              0.226     0.316     0.335     0.335     0.003     0.349
## sigma                                       0.710     0.668     0.659     0.659     0.806     0.652
## F                                         468.267   370.379   268.912   268.912     5.414   123.160
## p                                           0.000     0.000     0.000     0.000     0.020     0.000
## Log-likelihood                          -1721.057 -1621.814 -1599.384 -1599.384 -1923.929 -1580.453
## Deviance                                  805.870   711.796   692.105   692.105  1038.644   675.909
## AIC                                      3448.114  3251.628  3208.768  3208.768  3853.857  3178.905
## BIC                                      3464.245  3273.136  3235.654  3235.654  3869.988  3227.300
## N                                        1599      1599      1599      1599      1599      1599    
## ===================================================================================================

Since model 6 had the highest R^2 value, I tested it with some obvious extreme cases (based on what seems to have been discovered above) using only values for alcohol, volatile.acidity, and sulphates:

  1. A fake good red was created using a high alcohol content, low volatile.acidity, and average (median) sulphates level and is expected to have a high quality score.
fake good red quality score prediction (95% confidence interval)
##       fit    lwr      upr
## 1 7.43747 6.1107 8.764241
  1. A fake mid red was created using an average (median) alcohol content, average (median) volatile.acidity, and average (median) sulphates level and is expected to have a mid-range quality score.
fake mid-range red quality score prediction (95% confidence interval)
##        fit      lwr      upr
## 1 5.538351 4.259407 6.817296
  1. A fake bad red was created using a low alcohol content, high volatile.acidity, and low sulphates level and is expected to have a low quality score.
fake bad red quality score prediction (95% confidence interval)
##        fit      lwr      upr
## 1 4.845372 3.271417 6.419327

As should be somewhat expected from the entire investigation so far, combined with the not-too-shabby R^2 value of the simple linear regression model we selected, these predictions were spot on.

Strength of this model: it works for the obvious cases. Weakness of this model: it’s unclear how robust it is.


Final Plots and Summary

Plot One

alcohol vs volatile.acidity, colored by alcohol, faceted by quality

Description One

This plot makes it easy to see the distribution of quality scores (most are in the middle), the rightward trend of alcohol content, the downward slope of volatile.acidity, and the mid-range sweet-spot of sulphates levels all in relation to quality scoring.

Plot Two

quality vs alcohol, colored by volatile.acidity, with a smoothing line

Description Two

Although alcohol and quality are swapped from their perhaps expected axis locations, the swapping, along with the smoothing line, makes it clear that as quality increases, so do alcohol content (and, thus, the reverse relationship is true). Volatile.acidity continues to play a supporting role.

Plot Three

alcohol vs volatile.acidity with only clusters colored by quality

Description Three

Perhaps my favorite plot, borrowing from the experimentation with contours earlier, this plot, although leaving out sulphates, makes it clear that there are distinct clusters of quality scores that are quite obviously related to volatile.acidity and alcohol levels. If I were given a new red wine with only those two features listed, I would be very confident using merely this plot to predict the quality score (assuming the same wine experts responsible for this data set).


Reflection

A fruitful exercise, this project exposed two or three features of red wines that, when related to one another, seem to lead to obvious groupings. Alcohol, volatile.acidity, and sulphates (in that order) appear to affect the (perceived) quality of red wines, at least among those wine experts consulted in the making of this data set.

Figuring out how to use R was the biggest difficulty for me with this project. Now that I have pretty interesting prelimary results, it would be good to attempt to collect a new sample of data to see if similar results could be found again.